Using Semantic Analysis to Classify Search Engine Spam
نویسندگان
چکیده
Search engines have tried many techniques to filter out these spam pages before they can appear on the query results page. In Section 2 we present a collection of current methods that are being used to combat spam. We introduce a new approach to spam detection in Section 3 that uses semantic analysis of textual content as a means of detecting spam. This new approach uses a series of content analyzers combined with a decision tree classifier to determine if a given webpage is spam. Section 4 discusses the implementation of our approach. Our architecture is augments the search engine Lucene by adding a Java-based spam classifier. The spam classifier makes use of the Wordnet word database and the machine learning library Weka to classify web documents as either spam or not spam. We describe the results of our work in Section 5 and finally present our conclusions and future work in Section 6.
منابع مشابه
Low cost page quality factors to detect web spam
Web spam is a big challenge for quality of search engine results. It is very important for search engines to detect web spam accurately. In this paper we present 32 low cost quality factors to classify spam and ham pages on real time basis. These features can be divided in to three categories: (i) URL features, (ii) Content features, and (iii) Link features. We developed a classifier using Resi...
متن کاملA Novel Approach for Combating Spamdexing in Web using UCINET and SVM Light Tool
Search Engine spam is a web page or a portion of a web page which has been created with the intention of increasing its ranking in search engines. Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Anyone who uses a search engine frequently has most likely encountered a high ranking page that consists of nothing more than a bu...
متن کاملDetecting Stealth Web Pages That Use Click-Through Cloaking
Search spam is an attack on search engines’ ranking algorithms to promote spam links into top search ranking that they do not deserve. Cloaking is a wellknown search spam technique in which spammers serve one page to search-engine crawlers to optimize ranking, but serve a different page to browser users to maximize potential profit. In this experience report, we investigate a different and rela...
متن کاملSpam Filtering using Contextual Network Graphs
This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of...
متن کاملQuery expansion based on relevance feedback and latent semantic analysis
Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002